Efficient Record Linkage using a Double Embedding Scheme
نویسنده
چکیده
Record linkage is the problem of identifying similar records across different data sources. The similarity between two records is defined based on domain-specific similarity functions over several attributes. In this paper, a novel approach is proposed that uses a two level matching based on double embedding. First, records are embedded into a metric space of dimension K, then they are embedded into a smaller dimension K . The first matching phase operates on the K vectors, performing a quick-and-dirty comparison, pruning a large number of true negatives while ensuring a high recall. Then a more accurate matching phase is performed on the matching pairs in the K-dimension. Experiments have been conducted on real data sets and results revealed a gain in time performance ranging from 30% to 60% while achieving the same level of recall and accuracy as in previous single embedding schemes. Keywordsdata cleaning; similarity matching; record linkage; embedding schemes
منابع مشابه
An improved and efficient stenographic scheme based on matrix embedding using BCH syndrome coding.
This paper presents a new stenographic scheme based on matrix embedding using BCH syndrome coding. The proposed method embeds massage into cover by changing some coefficients of cover. In this paper defining a number :::as char:::acteristic of the syndrome, which is invariant with respect to the cyclic shift, we propose a new embedding algorithm base on BCH syndrome coding, without finding ro...
متن کاملAttribute-based Access Control for Cloud-based Electronic Health Record (EHR) Systems
Electronic health record (EHR) system facilitates integrating patients' medical information and improves service productivity. However, user access to patient data in a privacy-preserving manner is still challenging problem. Many studies concerned with security and privacy in EHR systems. Rezaeibagha and Mu [1] have proposed a hybrid architecture for privacy-preserving accessing patient records...
متن کاملLeveraging Social Media Signals for Record Linkage
Many data-intensive applications collect (structured) data from a variety of sources. A key task in this process is record linkage, which is the problem of determining the records from these sources that refer to the same real-world entities. Traditional approaches use the record representation of entities to accomplish this task. With the nascence of social media, entities on the Web are now a...
متن کاملAn efficient record linkage scheme using graphical analysis for identifier error detection
BACKGROUND Integration of information on individuals (record linkage) is a key problem in healthcare delivery, epidemiology, and "business intelligence" applications. It is now common to be required to link very large numbers of records, often containing various combinations of theoretically unique identifiers, such as NHS numbers, which are both incomplete and error-prone. METHODS We describ...
متن کاملSteganography Scheme Based on Reed-Muller Code with Improving Payload and Ability to Retrieval of Destroyed Data for Digital Images
In this paper, a new steganography scheme with high embedding payload and good visual quality is presented. Before embedding process, secret information is encoded as block using Reed-Muller error correction code. After data encoding and embedding into the low-order bits of host image, modulus function is used to increase visual quality of stego image. Since the proposed method is able to embed...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009